{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "3.7 合并数据集：Concat与Append操作"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "def make_df(cols, ind):            \n",
    "    \"\"\"一个简单的DataFrame\"\"\"            \n",
    "    data = {c: [str(c) + str(i) for i in ind]  for c in cols}           \n",
    "    return pd.DataFrame(data, ind) \n",
    "make_df('ABC', range(3))"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "      <td>C0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>C1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B   C\n",
       "0  A0  B0  C0\n",
       "1  A1  B1  C1\n",
       "2  A2  B2  C2"
      ]
     },
     "metadata": {},
     "execution_count": 1
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "3.7.2 通过pd.concat实现简易合并  "
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "source": [
    "ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])       \n",
    "ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])        \n",
    "pd.concat([ser1, ser2])"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "1    A\n",
       "2    B\n",
       "3    C\n",
       "4    D\n",
       "5    E\n",
       "6    F\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 2
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "source": [
    "df1 = make_df('AB', [1, 2])        \n",
    "df2 = make_df('AB', [3, 4])        \n",
    "print(df1); print(df2); print(pd.concat([df1, df2])) \n"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "    A   B\n",
      "1  A1  B1\n",
      "2  A2  B2\n",
      "    A   B\n",
      "3  A3  B3\n",
      "4  A4  B4\n",
      "    A   B\n",
      "1  A1  B1\n",
      "2  A2  B2\n",
      "3  A3  B3\n",
      "4  A4  B4\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "默认情况下，DataFrame的合并都是逐行进行的（默认设置是axis=0）。 与np.concatenate()一样，pd.concat也可以设置合并坐标轴，例如下面的示例："
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "source": [
    "df3 = make_df('AB', [0, 1])        \n",
    "df4 = make_df('CD', [0, 1])        \n",
    "print(df3); print(df4); print(pd.concat([df3, df4], axis=1))"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "    A   B\n",
      "0  A0  B0\n",
      "1  A1  B1\n",
      "    C   D\n",
      "0  C0  D0\n",
      "1  C1  D1\n",
      "    A   B   C   D\n",
      "0  A0  B0  C0  D0\n",
      "1  A1  B1  C1  D1\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "1. 索引重复  \n",
    "np.concatenate与pd.concat最主要的差异之一就是Pandas在合并时会保留索引，即使索引是重复的！  "
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "source": [
    "x = make_df('AB', [0, 1])        \n",
    "y = make_df('AB', [2, 3])        \n",
    "y.index = x.index  # 复制索引       \n",
    "print(x); print(y); print(pd.concat([x, y])) "
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "    A   B\n",
      "0  A0  B0\n",
      "1  A1  B1\n",
      "    A   B\n",
      "0  A2  B2\n",
      "1  A3  B3\n",
      "    A   B\n",
      "0  A0  B0\n",
      "1  A1  B1\n",
      "0  A2  B2\n",
      "1  A3  B3\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "(1) 捕捉索引重复的错误。如果你想要检测pd.concat()合并的结果中是否出现了重复的索引，可以设置verify_integrity参数。将参数设置为True，合并时若有索引重复就会触发异常。  "
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "source": [
    "try:             \n",
    "    pd.concat([x, y], verify_integrity=True)         \n",
    "except ValueError as e:             \n",
    "        print(\"ValueError:\", e) "
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "2) 忽略索引。有时索引无关紧要，那么合并时就可以忽略它们，可以通过设置ignore_index参数来实现。如果将参数设置为True，那么合并时将会创建一个新的整数索引。\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "source": [
    "print(x); print(y); print(pd.concat([x, y], ignore_index=True))"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "    A   B\n",
      "0  A0  B0\n",
      "1  A1  B1\n",
      "    A   B\n",
      "0  A2  B2\n",
      "1  A3  B3\n",
      "    A   B\n",
      "0  A0  B0\n",
      "1  A1  B1\n",
      "2  A2  B2\n",
      "3  A3  B3\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "(3) 增加多级索引。另一种处理索引重复的方法是通过keys参数为数据源设置多级索引标签，这样结果数据就会带上多级索引："
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "source": [
    "print(x); print(y); print(pd.concat([x, y], keys=['x', 'y'])) "
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "    A   B\n",
      "0  A0  B0\n",
      "1  A1  B1\n",
      "    A   B\n",
      "0  A2  B2\n",
      "1  A3  B3\n",
      "      A   B\n",
      "x 0  A0  B0\n",
      "  1  A1  B1\n",
      "y 0  A2  B2\n",
      "  1  A3  B3\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "2. 类似join的合并  \n",
    "前面介绍的简单示例都有一个共同特点，那就是合并的DataFrame都是同样的列名。而在实际工作中，需要合并的数据往往带有不同的列名，而pd.concat提供了一些选项来解决这类合并问题。  "
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "source": [
    "df5 = make_df('ABC', [1, 2])         \n",
    "df6 = make_df('BCD', [3, 4])         \n",
    "print(df5); print(df6); print(pd.concat([df5, df6]) )"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "    A   B   C\n",
      "1  A1  B1  C1\n",
      "2  A2  B2  C2\n",
      "    B   C   D\n",
      "3  B3  C3  D3\n",
      "4  B4  C4  D4\n",
      "     A   B   C    D\n",
      "1   A1  B1  C1  NaN\n",
      "2   A2  B2  C2  NaN\n",
      "3  NaN  B3  C3   D3\n",
      "4  NaN  B4  C4   D4\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "默认情况下，某个位置上缺失的数据会用NaN表示。如果不想这样，可以用join和join_axes参数设置合并方式。默认的合并方式是对所有输入列进行并集合并（join='outer'），当然也可以用join='inner'实现对输入列的交集合并："
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "source": [
    "print(df5); print(df6);         \n",
    "print(pd.concat([df5, df6], join='inner')) "
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "    A   B   C\n",
      "1  A1  B1  C1\n",
      "2  A2  B2  C2\n",
      "    B   C   D\n",
      "3  B3  C3  D3\n",
      "4  B4  C4  D4\n",
      "    B   C\n",
      "1  B1  C1\n",
      "2  B2  C2\n",
      "3  B3  C3\n",
      "4  B4  C4\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "source": [
    "print(df5); print(df6);         \n",
    "# print(pd.concat([df5, df6], join_axes=[df5.columns]))  #出错了？ "
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "    A   B   C\n",
      "1  A1  B1  C1\n",
      "2  A2  B2  C2\n",
      "    B   C   D\n",
      "3  B3  C3  D3\n",
      "4  B4  C4  D4\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "3. append()方法"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "source": [
    "print(df1); print(df2); print(df1.append(df2))"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "    A   B\n",
      "1  A1  B1\n",
      "2  A2  B2\n",
      "    A   B\n",
      "3  A3  B3\n",
      "4  A4  B4\n",
      "    A   B\n",
      "1  A1  B1\n",
      "2  A2  B2\n",
      "3  A3  B3\n",
      "4  A4  B4\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [],
   "outputs": [],
   "metadata": {}
  }
 ],
 "metadata": {
  "orig_nbformat": 4,
  "language_info": {
   "name": "python",
   "version": "3.8.10",
   "mimetype": "text/x-python",
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "pygments_lexer": "ipython3",
   "nbconvert_exporter": "python",
   "file_extension": ".py"
  },
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3.8.10 64-bit ('d2l-zh': conda)"
  },
  "interpreter": {
   "hash": "0bedf3452ed48e38a2b75ff9f66ce8ae9ae2e92e8665cb618ee0797d2f46b9e5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}